This document reports the code for discovering COVID-19 meta-subgroups by comorbidities and habits, evaluated on the Mexican Government dataset published by the Mexican government. This document, as well as the supporting functions required for its execution, are available in the COVID-19 Metaclustering GitHub repository.
The COVID-19 infectious disease has led since December 2019 to a global pandemic that remains under control measures. Researchers around the world are making great efforts for a comprehensive understanding of COVID-19 from various data sources and modalities, with the goal of improving prognosis and treatments.
This work shows the results of a two-stage unsupervised Machine Learning (ML) approach, namely meta-clustering, to identify potential subphenotypes of COVID-19 patients from clinical phenotypes and demographic features. Stratification on gender and age groups was included to reduce potential confounding factors since age and gender are highly correlated with comorbidity, habits and mortality. It is worth noting that although our meta-clustering approach in this tutorial is based on age-gender stratification, in real-world there are other intriguing variable combinations that we may be interested in that can also be replicated using this code with subtle changes. For example, considering the socioeconomic level of the city where the patients live, or separating rural and non-rural areas to obtain two different sets of clusters (rural clusters and non-rural clusters) since the socioeconomic level of the city affects the amount of resources that are available for patients and thereby affecting the outcome of patients.
This work is part of the COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool) project from the Biomedical Data Science Lab of the Universitat Politècnica de València, Spain.
We analyzed the COVID-19 open data from the Mexican Government that is released on 2020-11-02. After processing the data with proper inclusion criteria, the final sample comprises patient-level epidemiological and clinical data from 778 692 SARS-CoV-2 patients from January 13, 2020, to September 30, 2020. The methods for the dataset preprocessing is described in CovMexico_Data_Preprocessing.ipynb in our COVID-19 Metaclustering GitHub repository as well.
Inter-patient variability was analyzed by combining dimensionality reduction and hierarchical clustering methods. We produced cluster analyses for all combinations of gender and age groups (<18, 18-49, 50-64, and >64). For each group, the optimum number of clusters in this work were selected combining a quantitative approach using Silhouette coefficient, and a qualitative method through a subgroup expert inspection via visual analytics. Using the features of the resultant age-gender clusters, we performed a meta-clustering analysis to provide an overall description of the population.
In this tutorial, we found 56 age-gender specific clusters with a Multiple Correspondence Analysis 3-dimensional and a hierarchical clustering. Afterwards, we applied PCA with these age-gender clusters and then used the first 3 Principal Components as input to apply meta-clustering. Finally, we found eleven meta-clusters that provide bases to comprehend classification of patients with COVID-19 based on comorbidities, habits, demographic characteristics, geographic data and type of clinical institutions, as well as revealing the correlations between the above characteristics thereby help to anticipate the possible clinical outcomes for every specifically characterized patient. These subphenotypes can establish target groups for automated stratification or triage systems to provide personalized therapies or treatments.
All the codes can be replicated easily and adapted on many different kinds of datasets such as patient-level or individual-level dataset.
If you use this code please cite:
Lexin Zhou, Nekane Romero, Juan MartÃnez-Miranda, J Alberto Conejero, Juan M GarcÃa-Gómez, Carlos Sáez. Heterogeneity in COVID-19 severity patterns among age-gender groups: an analysis of 778 692 Mexican patients through a meta-clustering technique. Preprint submitted to medRxiv.
If you are interested in collaborating in this work please contact us.
For further exploration of the results, please visit our COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool). For more practical examples, you may also be interested in our preceding work about COVID-19 subgroup discovery tool, application to the nCov2019 datase whose methodology is well described in another GitHub repository COVID-19-SDE-Tool.
Install and load the required packages
# the function p_load() from pacman package checks to see if a package is installed, if not it attempts to install the package and then loads it
# By this way, we simplify the notebook's structure
# install.packages("pacman")
library(pacman)
pacman::p_load(writexl, ggplot2, ggfortify, text2vec, plotly, factoextra, dplyr, stringr, kableExtra, MVN, lsa, Rtsne, FactoMineR, ape, RColorBrewer, survival, survminer, varhandle, glue, cluster, fastcluster, graphics, clusterCrit, ComplexHeatmap, reshape, pheatmap)Source the required functions. These can be found at the COVID-19 Metaclustering GitHub repository. We recommended to download the whole repository, which includes this document .Rmd file.
Load the nCov2019 dataset at 2020-11-02.
### LOAD AND FORMAT DATASET
untar('./data/COVID19_MEXICO_LATEST_DATA.tar.gz', exdir = "data")
data = read.csv2("./data/COVID19_MEXICO_LATEST_DATA.csv", encoding="UTF-8", sep = ",", header = TRUE, na.strings = "", stringsAsFactors = FALSE,dec=".",
colClasses = c( "numeric", #Index of row
"Date", #LAST_UPDATE
"character", #ID_REGISTRATION
"factor", #ORIGIN
"factor", #SECTOR
"character", #ENTITY_UM
"factor", #SEX
"character", #ENTITY_NAT
"character", #ENTITY_RES
"character", #MUNICIPALITY_RES
"factor", #PATIENT_TYPE
"Date", #ADMISSION_DATE
"Date", #SYMPTOMS_DATE
"Date", #DEATH_DATE
"numeric", #INTUBATED
"numeric", #PNEUMONIA
"numeric", #AGE
"character", #NATIONALITY
"numeric", #PREGNANT
"numeric", #SPEAK_INDIGENOUS_LANGUAGE
"numeric", #Indigenous
"numeric", #DIABETES
"numeric", #COPD
"numeric", #ASTHMA
"numeric", #INMUSUPR
"numeric", #HYPERTENSION
"numeric", #OTHER_DISEASE
"numeric", #CARDIOVASCULAR
"numeric", #OBESITY
"numeric", #CHRONIC_KIDNEY_DISEASE
"numeric", #SMOKE
"factor", #OTHER_CASE_CONTACT
"factor", #TESTED
"factor", #RESULT_LAB
"factor", #FINAL_CLASIFICATION
"factor", #MIGRANT
"character", #COUNTRY_NATIONALITY
"character", #COUNTRY_ORIGIN
"factor", #UCI
"factor", #OUTCOME ('See the Python notebook')
"numeric", #Survival_days ('See the Python notebook')
"character" #From_symptoms_to_hospital_days ('See the Python notebook')
)) # Select only positive patients
data = data[data$RESULT_LAB == 'Positive SARS-CoV-2', ]
# data = data[data$RESULT_LAB == 'Non-Positive SARS-CoV-2', ]
# convert some variable types
data$ageNum = as.numeric(data$AGE)
data$Survival_days = as.numeric(data$Survival_days)
data$FromSymptomToHospital_days = as.numeric(data$FromSymptomToHospital_days)
data$ageCat = vector(mode = "character", length = nrow(data))
names(data)[names(data) == "UCI"] <- "ICU" # Translate UCI (Spanish) to ICU (Intensive Care Unit)
# make age groups <17, 18-49, 50-64, >65
idsLeq17 = data$ageNum <= 17
ids18to49 = data$ageNum > 17 & data$ageNum <= 49
ids50to64 = data$ageNum > 49 & data$ageNum <= 64
idsGeq65 = data$ageNum > 64
data$ageCat[idsLeq17] = '<18'
data$ageCat[ids18to49] = '18-49'
data$ageCat[ids50to64] = '50-64'
data$ageCat[idsGeq65] = '>64'
# Manually fix variable types
data$ADMISSION_DATE = as.Date(data$ADMISSION_DATE)
data$LAST_UPDATE = as.Date(data$LAST_UPDATE)
data$SYMPTOMS_DATE = as.Date(data$SYMPTOMS_DATE)
data$DEATH_DATE = as.Date(data$DEATH_DATE)
# remove rows with missing Admission dates and remove index column
data = data[!is.na(data$ADMISSION_DATE),-1]
# remove rows: LAST_UPDATE and ID_REGISTRATION due to their irrelevance
data = subset(data, select = -c(LAST_UPDATE, ID_REGISTRATION))
# Create new columns for categorizing survival days
data$'Survival>15days' <- ifelse(data$Survival_days>15, 1, 0)
data$'Survival>30days' <- ifelse(data$Survival_days>30, 1, 0)
data$'Survival>15days'[is.na(data$'Survival>15days')]<-1 # Fill nan-value
data$'Survival>30days'[is.na(data$'Survival>30days')]<-1
data_outcome = data[which(!is.na(data$OUTCOME)),]
data_outcome = data_outcome[!is.na(data_outcome$ageNum),]
# creates binary variables for age range.
age_rangeVector <- to.dummy(data$ageCat, "Age")
data_outcome = cbind(data_outcome, age_rangeVector)
names(data_outcome)[names(data_outcome) == "Age.<18"] <- "Age<18"
names(data_outcome)[names(data_outcome) == "Age.18-49"] <- "Age18-49"
names(data_outcome)[names(data_outcome) == "Age.50-64"] <- "Age50-64"
names(data_outcome)[names(data_outcome) == "Age.>64"] <- "Age>64" ### It's worth noting that these vectors should be changed if the user is interested in replicating the analysis with different dataset with different variables.
name <- c()
clusterSize <- c()
Age<- c()
Gender <- c()
Pregnant <- c()
Obesity <- c()
Smoke <- c()
Pneumonia <- c()
Diabetes <- c()
COPD <- c()
Asthma <- c()
INMUSUPR <- c()
Hypertension <- c()
Other_Disease <- c()
Cardiovascular <- c()
CKD <- c()
Recovery <- c()
Survival_days <- c()
FifteenSurvival <- c() # For all patients
ThirtySurvival <- c() # For all patients
FifteenSurvival_deceased <- c() # Only deceased patients
ThirtySurvival_deceased <- c() # Only deceased patients
Symptoms_to_hospitalization_days <- c()
Hospitalized <- c()
ICU <- c()
Intubated <- c()
Other_case_contact <- c()
Age_mean <- c()You may consider to change the number k for your own analysis by the following chunks. In this tutorial we select different number of k for different age-gender groups.
ages >64 years:
ages between 50-64 years:
ages between 18-49 years:
ages < 18 years:
Create List for the Silhouette values in all age-gender groups (not used in this tutorial) since we only used them to illustrate Silhouette Plot in our COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool)
this may take a long time depending on the sample size. Our dataset (n = 778 692) lasted roughly 3 hours
Genders <- c("Male", "Female") # M=Male, F=Female
Age_cat <- c("<18", "18-49", "50-64", ">64")
for (ageRange in Age_cat){
for (sex in Genders) {
### Filtering the data and do some modification in the data frame for MCA usage
dataExperiment = data_outcome[data_outcome$ageCat == ageRange,]
dataExperiment = dataExperiment[dataExperiment$SEX == sex,]
bestk = Bestk[[ageRange]][[sex]]
# print(ageRange)
# print(sex)
### Some preparation for MCA (Multiple Correspondence Analysis)
# vector for comorbidities
chronicsVector = dataExperiment %>% select(13, 19,20,21,22,23,24, 25, 27)
# vector for habits
featuresVector = dataExperiment %>% select(26,28)
featuresVector = data.frame(featuresVector)
colnames(featuresVector) = paste0("F_",colnames(featuresVector))
chronicsVector = data.frame(chronicsVector)
colnames(chronicsVector) = paste0("C_",colnames(chronicsVector))
data_analysis = cbind(featuresVector,chronicsVector)
data_analysis = sapply(data_analysis, as.logical)
### Select the variables that we sought to analyze
data_analysis_metadata = dataExperiment[,c("ageNum", "ageCat", "SEX", "ENTITY_RES","NATIONALITY","SMOKE", "SECTOR","PREGNANT","OUTCOME","ICU","INTUBATED","SECTOR","OBESITY","OTHER_CASE_CONTACT","Survival_days","PATIENT_TYPE","FromSymptomToHospital_days","Age<18","Age18-49","Age50-64","Age>64",'Survival>15days','Survival>30days')]
### Perform Multiple Correspondence Analysis on 3 dimensions
res.mca = MCA(data_analysis, ncp = 3, graph = TRUE)
# fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))
ind = res.mca$ind
var = res.mca$var
### Perform clustering
clusterdistEuclideanMCA <- hclust.vector(ind$coord, method="ward", members=NULL, metric='euclidean', p=NULL) # ward is equivalent to ward.D2 in the library fastcluster
groups <- cutree(clusterdistEuclideanMCA, k=bestk)
sizes <- dataExperiment$ageNum
resultsClustering = list("ind" = ind, "clusterdistEuclideanMCA" = clusterdistEuclideanMCA, "k" = bestk, "groups" = groups, "sizes" = sizes)
### Calculate Silhouette Coefficients for each age-gender cluster
Silhouette <- intCriteria(ind$coord, resultsClustering$groups, c("Silhouette"))
SilhoutteL[[ageRange]][[sex]] = Silhouette
### selection of true habits and comorbidities values from MCA
res.mca$call$marge.col
trues = endsWith(names(res.mca$call$marge.col), "TRUE")
coord_T = var$coord[trues,]
rownames(coord_T) = substr(rownames(coord_T),1,nchar(rownames(coord_T))-5)
coord_T_F = coord_T[1:ncol(featuresVector),]
rownames(coord_T_F) = substr(rownames(coord_T_F),3,nchar(rownames(coord_T_F)))
coord_T_C = coord_T[-(1:ncol(featuresVector)),]
rownames(coord_T_C) = substr(rownames(coord_T_C),3,nchar(rownames(coord_T_C)))
### Calculate statistics
uniqueGroups = unique(resultsClustering$groups)
# change a bit the colnames' format (optional)
colnames(data_analysis) = substr(colnames(data_analysis),3,nchar(colnames(data_analysis)))
colnames(data_analysis) = make.names(colnames(data_analysis), unique = TRUE)
colnames(data_analysis) = str_replace_all(colnames(data_analysis),"_", " ")
for (i in 1:length(uniqueGroups)){
if(sex=='Male') {
sex_abbre = 'M'
} else {
sex_abbre = 'F'
}
name <- append(name, paste(ageRange, sex_abbre,i, sep="")) # The name of the age-gender cluster. I.e: '18-49M7'
patientGroupIdx = resultsClustering$groups %in% uniqueGroups[i]
nPatientsGroup = sum(patientGroupIdx)
clusterSize <- append(clusterSize, nPatientsGroup) # The 'n' of the cluster
# comorbidity statistics
data_analysis_subgroup = data_analysis[patientGroupIdx,,drop = FALSE]
nind = nrow(data_analysis_subgroup)
data_analysis_subgroupT = t(data_analysis_subgroup)
resultsS = sapply(data.frame(data_analysis_subgroup), function(x) Prop(x)*100) ### The proportion results of input variables
Obesity <- append(Obesity, resultsS[1])
Smoke <- append(Smoke, resultsS[2])
Pneumonia <- append(Pneumonia, resultsS[3])
Diabetes <- append(Diabetes, resultsS[4])
COPD <- append(COPD, resultsS[5])
Asthma <- append(Asthma, resultsS[6])
INMUSUPR <- append(INMUSUPR, resultsS[7])
Hypertension <- append(Hypertension, resultsS[8])
Other_Disease <- append(Other_Disease, resultsS[9])
Cardiovascular <- append(Cardiovascular, resultsS[10])
CKD <- append(CKD, resultsS[11])
Age<- append(Age, ageRange)
# sex, age, recovered statistics
data_analysis_metadata_subgroup = data_analysis_metadata[patientGroupIdx,]
# Age range statistics
Age17Stats = Prop(data_analysis_metadata_subgroup$'Age<18' == 1)*100
Age18Stats = Prop(data_analysis_metadata_subgroup$'Age18-49' == 1)*100
Age50Stats = Prop(data_analysis_metadata_subgroup$'Age50-64' == 1)*100
Age65Stats = Prop(data_analysis_metadata_subgroup$'Age>64' == 1)*100
# Gender and pregnancy statistics
femaleStats = Prop(as.character(data_analysis_metadata_subgroup$SEX) %in% 'Female')*100
PregnantStats = Prop(data_analysis_metadata_subgroup$PREGNANT == 1)*100
# ageMean = mean(data_analysis_metadata_subgroup$ageNum)
ageStats = Mean(data_analysis_metadata_subgroup$ageNum)
# Outcome statistics
recoveredStats = Prop(data_analysis_metadata_subgroup$OUTCOME == 'Non-Deceased')*100
survivalStats = Mean(data_analysis_metadata_subgroup$Survival_days[data_analysis_metadata_subgroup$OUTCOME == 'Deceased'], na.rm = TRUE)
Survival15Stats = Prop(data_analysis_metadata_subgroup$'Survival>15days' == 1)*100
Survival30Stats = Prop(data_analysis_metadata_subgroup$'Survival>30days' == 1)*100
aux <- data_analysis_metadata_subgroup[data_analysis_metadata_subgroup$OUTCOME == "Deceased",]
Survival15Stats_deceased = Prop(aux$'Survival>15days' == 1)*100
Survival30Stats_deceased = Prop(aux$'Survival>30days' == 1)*100
SympToHostStats = Mean(data_analysis_metadata_subgroup$FromSymptomToHospital_days, na.rm = TRUE)
ICUStats = Prop(data_analysis_metadata_subgroup$ICU == 1)*100
intubStats = Prop(data_analysis_metadata_subgroup$INTUBATED == 1)*100
Patient_TypeStats = Prop(data_analysis_metadata_subgroup$PATIENT_TYPE == 'HOSPITALIZED')*100
Case_ContactStats = Prop(data_analysis_metadata_subgroup$OTHER_CASE_CONTACT == 1)*100
### Append the Stats to vectors
Pregnant <- append(Pregnant, PregnantStats)
Age_mean <- append(Age_mean, ageStats)
Recovery <- append(Recovery, recoveredStats)
Survival_days <- append(Survival_days, survivalStats)
FifteenSurvival <- append(FifteenSurvival, Survival15Stats)
ThirtySurvival <- append(ThirtySurvival, Survival30Stats)
FifteenSurvival_deceased <- append(FifteenSurvival_deceased, Survival15Stats_deceased)
ThirtySurvival_deceased <- append(ThirtySurvival_deceased, Survival30Stats_deceased)
Symptoms_to_hospitalization_days <- append(Symptoms_to_hospitalization_days, SympToHostStats)
Hospitalized <- append(Hospitalized, Patient_TypeStats)
ICU <- append(ICU, ICUStats)
Intubated <- append(Intubated, intubStats)
Other_case_contact <- append(Other_case_contact, Case_ContactStats)
Gender <- append(Gender, sex)
}
}
}df <- data.frame(name, Obesity, Smoke, Pneumonia, Diabetes, COPD, Asthma, INMUSUPR,
Hypertension, Other_Disease, Cardiovascular, CKD, clusterSize, Age, Gender,
Pregnant, Recovery, Survival_days, FifteenSurvival, ThirtySurvival, FifteenSurvival_deceased, ThirtySurvival_deceased, Symptoms_to_hospitalization_days,
Hospitalized, ICU, Intubated, Other_case_contact, Age_mean)
### Each row comprises the calculated features of each age-gender group
rownames(df) <- df$name
data_analysis_metadata = df# vector for comorbidities
chronicsVector = data_analysis_metadata %>% dplyr::select(4,5,6,7,8,9,10,11,12)
# vector for habits
featuresVector = data_analysis_metadata %>% dplyr::select(2,3)
featuresVector = data.frame(featuresVector)
colnames(featuresVector) = paste0("F_",colnames(featuresVector))
chronicsVector = data.frame(chronicsVector)
colnames(chronicsVector) = paste0("C_",colnames(chronicsVector))
data_analysis = cbind(featuresVector,chronicsVector) # Will be used later in clustering
pca_df <- data_analysis_metadata[2:12]
pca<-(prcomp(pca_df)) # Apply PCA algorithm
### make a scree plot (loadings)
a = fviz_screeplot(pca, addlabels = TRUE, ylim = c(0, 45))
### plot pc1 and pc2
p <- autoplot(pca, data=data_analysis_metadata, colour="Age",frame=TRUE,loadings = TRUE, loadings.colour = 'blue', label = TRUE, label.size = 5, main='2D PCA from 56 COVID-19 subgroups',
loadings.label = TRUE, loadings.label.size = 5)+
theme_bw()k.### Analyze the Silhouette Coefficients with plot for meta-clustering
SilhouetteCoefficients = c()
for (i in 2:12){
bestk = i
ind = pca$x[, 1:3] # equivalent to ind$coord
clusterdistEuclideanPCA = hclust.vector(ind, method="ward", members=NULL, metric='euclidean', p=NULL) #Equivalent to hclust but faster with vector data.
groups <- cutree(clusterdistEuclideanPCA, k=bestk)
sizes <- data_analysis_metadata$Age
resultsClustering = list("ind" = pca$x[,1:3], "clusterdistEuclideanPCA" = clusterdistEuclideanPCA, "k" = bestk, "groups" = groups, "sizes" = sizes)
a = intCriteria(pca$x[,1:3], resultsClustering$groups, c("Silhouette"))
SilhouetteCoefficients <- append(SilhouetteCoefficients, a[[1]])
}
plot<-plot(SilhouetteCoefficients, type='o',col = "blue", xlab = "Number of clusters (k)", ylab = "Silhouette Coefficient",
main = "Silhouette Coefficient Plot")### We select k=11 for meta-clustering
bestk = 11
ind = pca$x[, 1:3] # equivalent to ind$coord in MCA
clusterdistEuclideanPCA = hclust.vector(ind, method="ward", members=NULL, metric='euclidean', p=NULL) # using Ward’s minimum variance method with Euclidean squared distance
groups <- cutree(clusterdistEuclideanPCA, k=bestk)
sizes <- data_analysis_metadata$Age
resultsClustering = list("ind" = pca$x[,1:3], "clusterdistEuclideanPCA" = clusterdistEuclideanPCA, "k" = bestk, "groups" = groups, "sizes" = sizes)Perform first some preparation to facilitate visualizations.
# selection of true symptoms and comorbidities values from MCA
trues = endsWith(names(res.mca$call$marge.col), "TRUE")
coord_T = var$coord[trues,]
rownames(coord_T) = substr(rownames(coord_T),1,nchar(rownames(coord_T))-5)
coord_T_F = coord_T[1:ncol(featuresVector),]
rownames(coord_T_F) = substr(rownames(coord_T_F),3,nchar(rownames(coord_T_F)))
coord_T_C = coord_T[-(1:ncol(featuresVector)),]
rownames(coord_T_C) = substr(rownames(coord_T_C),3,nchar(rownames(coord_T_C)))
# configuration of scatter properties
markerSize2d = 35
markerSize2ddiamond = 25
markerSize3d = 300
markerSize3ddiamond = 250
axx <- list(
title = "1<sup>st</sup> component"
)
axy <- list(
title = "2<sup>nd</sup> component"
)
axz <- list(
title = "3<sup>rd</sup> component"
)
### Hide grid in order to obtain high resolution plot without harsh grid lines after exporting to html file
axx_no_grid <- list(
title = "1<sup>st</sup> component", showgrid = FALSE
)
axy_no_grid <- list(
title = "2<sup>nd</sup> component", showgrid = FALSE
)
# color resources
colorsVars = brewer.pal(n = 2, name = "BrBG")In the following results, the information about the subgroups is consistent within scatter plots, detailed table, pattern discovery using LOESS model, and correspond to the subgroups found in the previous application of clustering to the selected Age-gender group with the corresponding number of clusters k.
Subgroups scatter
is.num <- sapply(data_analysis_metadata, is.numeric)
data_analysis_metadata[is.num] <- lapply(data_analysis_metadata[is.num], round, 2)
ax <- list(
title = "",
showgrid = FALSE
)
pClusters2d <- plot_ly(x = pca$x[,1], y = pca$x[,2],
color = as.factor(paste("Subgroup",format(resultsClustering$groups, digits=2))),
colors = brewer.pal(n = bestk, name = "Set3"),
size=7,
text = data_analysis_metadata$name,
marker = list(sizemode = 'area'),
scene = 'sceneClusters') %>%
layout(title = "Subgroups (obtained at 3D)",
xaxis = ax,
yaxis = ax) %>%
config(displaylogo = FALSE)
t <- list(
family = "sans serif",
size = 15,
showgrid=FALSE,
color = toRGB("grey50"))
pClusters2d <- pClusters2d %>% add_text(textfont = t)
pClusters2dpClusters3d <- plot_ly(x = pca$x[,1], y = pca$x[,2], z = pca$x[,3],
color = as.factor(paste("Subgroup",format(resultsClustering$groups, digits=2))),
colors = brewer.pal(n = length(unique(resultsClustering$groups)), name = "Set2"),
size = I(markerSize3d), type = "scatter3d", mode = "markers",
text = paste0("Patient ID: ",rownames(pca$x),"\n Age:",resultsClustering$sizes),
marker = list(sizemode = 'area'),
scene = 'sceneClusters') %>%
layout(title = "Subgroups", scene = list(xaxis=axx,yaxis=axy,zaxis=axz)) %>%
config(displaylogo = FALSE)
t <- list(
family = "sans serif",
size = 8,
color = toRGB("grey50"))
pClusters3dalphaci = 0.05
uniqueGroups = unique(resultsClustering$groups)
resultsTable = data.frame(matrix(nrow = ncol(data_analysis)+19, ncol = length(uniqueGroups)))
colnames(data_analysis) = substr(colnames(data_analysis),3,nchar(colnames(data_analysis)))
colnames(data_analysis) = make.names(colnames(data_analysis), unique = TRUE)
colnames(data_analysis) = str_replace_all(colnames(data_analysis),"_", " ")
# Note: this (whihout make.names) returns an error when same text is in symptoms and comorbidities, a solution might be using a column for names, or avoid repated terms
rownames(resultsTable) <- c(sprintf('No. of individuals (n<sub>total</sub> = %d)',nrow(data_analysis)), colnames(data_analysis),"Age<17","Age18-49","Age50-64","Age>65",'PREGNANT', 'Females','Age','Recovered','Survival days (deceased)','Survival>15days','Survival>30days', 'Survival>15days_deceased','Survival>30days_deceased','Symptoms to hospitalization days','Hospitalized','ICU','INTUBATED','Other case contact')
colnames(resultsTable) <- paste('Subgroup',uniqueGroups)
for (i in 1:length(uniqueGroups)){
patientGroupIdx = resultsClustering$groups %in% uniqueGroups[i]
nPatientsGroup = sum(patientGroupIdx)
# comorbidity statistics
data_analysis_subgroup = data_analysis[patientGroupIdx,,drop = FALSE]
nind = nrow(data_analysis_subgroup)
data_analysis_subgroupT = t(data_analysis_subgroup)
resultsS = sapply(data.frame(data_analysis_subgroup), function(x) mCI(x, alphaci))
subgroupResultColumn = apply(resultsS, 2, function(x) sprintf('%.2f (%.2f-%.2f)',x[1],x[2],x[3]))
# sex, age, recovered statistics
data_analysis_metadata_subgroup = data_analysis_metadata[patientGroupIdx,]
#Age range stats
Age17Stats = pCI(data_analysis_metadata_subgroup$Age == '<=17', alphaci)*100
Age18Stats = pCI(data_analysis_metadata_subgroup$Age == '18-49', alphaci)*100
Age50Stats = pCI(data_analysis_metadata_subgroup$Age == '50-64', alphaci)*100
Age65Stats = pCI(data_analysis_metadata_subgroup$Age == '>=65', alphaci)*100
femaleStats = pCI(data_analysis_metadata_subgroup$Gender=='Female', alphaci)*100
PregnantStats = mCI(data_analysis_metadata_subgroup$Pregnant, alphaci)
ageStats = mCI(data_analysis_metadata_subgroup$Age_mean, alphaci)
recoveredStats = mCI(data_analysis_metadata_subgroup$Recovery, alphaci)
survivalStats = mCI(data_analysis_metadata_subgroup$Survival_days, alphaci, na.rm = TRUE)
SympToHostStats = mCI(data_analysis_metadata_subgroup$Symptoms_to_hospitalization_days, alphaci, na.rm = TRUE)
ICUStats = mCI(data_analysis_metadata_subgroup$ICU, alphaci)
intubStats = mCI(data_analysis_metadata_subgroup$Intubated, alphaci)
Patient_TypeStats = mCI(data_analysis_metadata_subgroup$Hospitalized, alphaci)
Case_ContactStats = mCI(data_analysis_metadata_subgroup$Other_case_contact, alphaci)
Survival15Stats = mCI(data_analysis_metadata_subgroup$FifteenSurvival,alphaci, na.rm = TRUE) ### Both deceased and non-deceased patients
Survival30Stats = mCI(data_analysis_metadata_subgroup$ThirtySurvival,alphaci, na.rm = TRUE)
Survival15_deceasedStats = mCI(data_analysis_metadata_subgroup$FifteenSurvival_deceased,alphaci, na.rm = TRUE) ### Percetages among deceased patients
Survival30_deceasedStats = mCI(data_analysis_metadata_subgroup$ThirtySurvival_deceased,alphaci, na.rm = TRUE)
# compile final result column
subgroupResultColumn = c(as.character(nPatientsGroup), subgroupResultColumn, sprintf('%.2f (%.2f-%.2f)',Age17Stats[1],Age17Stats[2],Age17Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age18Stats[1],Age18Stats[2],Age18Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age50Stats[1],Age50Stats[2],Age50Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age65Stats[1],Age65Stats[2],Age65Stats[3]), sprintf('%.2f
(%.2f-%.2f)', PregnantStats[1], PregnantStats[2],PregnantStats[3]),sprintf('%.2f (%.2f-%.2f)',femaleStats[1],femaleStats[2],femaleStats[3]), sprintf('%.2f (%.2f-%.2f)',ageStats[1],ageStats[2],ageStats[3]), sprintf('%.2f (%.2f-%.2f)',recoveredStats[1],recoveredStats[2],recoveredStats[3]))
subgroupResultColumn = c(subgroupResultColumn, sprintf('%.2f (%.2f-%.2f)',survivalStats[1],survivalStats[2],survivalStats[3]),
sprintf('%.2f (%.2f-%.2f)',Survival15Stats[1],Survival15Stats[2],Survival15Stats[3]),
sprintf('%.2f (%.2f-%.2f)',Survival30Stats[1],Survival30Stats[2],Survival30Stats[3]),
sprintf('%.2f (%.2f-%.2f)',Survival15_deceasedStats[1],Survival15_deceasedStats[2],Survival15_deceasedStats[3]),
sprintf('%.2f (%.2f-%.2f)',Survival30_deceasedStats[1],Survival30_deceasedStats[2],Survival30_deceasedStats[3]),
sprintf('%.2f (%.2f-%.2f)',SympToHostStats[1],SympToHostStats[2],SympToHostStats[3]))
subgroupResultColumn = c(subgroupResultColumn,sprintf('%.2f (%.2f-%.2f)',Patient_TypeStats[1],Patient_TypeStats[2],Patient_TypeStats[3]), sprintf('%.2f (%.2f-%.2f)',ICUStats[1],ICUStats[2],ICUStats[3]),sprintf('%.2f (%.2f-%.2f)',intubStats[1],intubStats[2],intubStats[3]), sprintf('%.2f (%.2f-%.2f)',Case_ContactStats[1],Case_ContactStats[2],Case_ContactStats[3]))
resultsTable[i] <- subgroupResultColumn
}
# resultsTable$variables <- rownames(resultsTable)
# write_xlsx(resultsTable,"C:\\Users\\Lexin Zhou\\11Meta_subgroups_table.xlsx")
# names(resultsTable) <- cell_spec(names(resultsTable), background = "yellow")
tTable = kable(resultsTable, format = "html", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
pack_rows("Features (%|x, CI 95%)", 2, 1+ncol(featuresVector)) %>%
pack_rows("Comorbidities (%|x, CI 95%)", 2+ncol(featuresVector), 2+ncol(featuresVector)+ncol(chronicsVector)-1) %>%
pack_rows("Age range (%|x, CI 95%)", 2+ncol(featuresVector)+ncol(chronicsVector), 2+ncol(featuresVector)+ncol(chronicsVector)+3)%>%
pack_rows("Demographics (%|x, CI 95%)", 6+ncol(featuresVector)+ncol(chronicsVector), 2+ncol(featuresVector)+ncol(chronicsVector)+6) %>%
pack_rows("Outcomes (%|x, CI 95%)", 2+ncol(featuresVector)+ncol(chronicsVector)+7, 2+ncol(featuresVector)+ncol(chronicsVector)+7+8+2)
groupColors = brewer.pal(n = ncol(resultsTable), name = "Set3")
for (i in 1:ncol(resultsTable)) {
tTable = tTable %>% column_spec(i+1, background = groupColors[i], include_thead = TRUE) %>%
column_spec(i+1, background = "inherit")
}
tTable| Subgroup 1 | Subgroup 2 | Subgroup 3 | Subgroup 4 | Subgroup 5 | Subgroup 6 | Subgroup 7 | Subgroup 8 | Subgroup 9 | Subgroup 10 | Subgroup 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| No. of individuals (ntotal = 56) | 8 | 6 | 8 | 8 | 3 | 7 | 4 | 3 | 5 | 3 | 1 |
| Features (%|x, CI 95%) | |||||||||||
| Obesity | 0.44 (-0.42-1.29) | 11.77 (1.77-21.77) | 59.87 (37.37-82.38) | 12.01 (4.36-19.67) | 75.54 (58.77-92.30) | 18.89 (14.98-22.81) | 20.15 (11.99-28.30) | 19.05 (3.44-34.66) | 25.94 (-1.77-53.66) | 50.51 (39.69-61.34) | 23.99 (NA-NA) |
| Smoke | 0.00 (0.00-0.00) | 9.68 (-0.96-20.31) | 34.09 (11.26-56.92) | 0.80 (-0.77-2.36) | 10.77 (2.19-19.36) | 8.10 (4.96-11.24) | 4.38 (0.40-8.37) | 38.03 (10.93-65.13) | 0.22 (-0.22-0.67) | 42.02 (21.36-62.68) | 76.85 (NA-NA) |
| Comorbidities (%|x, CI 95%) | |||||||||||
| Pneumonia | 12.36 (0.06-24.67) | 37.00 (10.24-63.76) | 9.08 (1.88-16.28) | 37.18 (28.28-46.08) | 41.52 (29.64-53.40) | 42.44 (33.60-51.28) | 52.14 (49.38-54.90) | 43.55 (35.90-51.21) | 48.10 (42.38-53.83) | 46.80 (34.44-59.15) | 53.61 (NA-NA) |
| Diabetes | 0.00 (0.00-0.00) | 4.42 (-0.68-9.52) | 4.50 (1.09-7.91) | 39.06 (20.82-57.29) | 57.14 (46.70-67.59) | 35.62 (26.64-44.60) | 76.44 (72.73-80.15) | 20.45 (9.76-31.14) | 95.00 (88.20-101.79) | 61.22 (48.65-73.80) | 31.96 (NA-NA) |
| COPD | 0.00 (0.00-0.00) | 4.51 (-0.36-9.38) | 0.00 (0.00-0.00) | 0.73 (-0.70-2.16) | 0.00 (0.00-0.00) | 5.10 (1.29-8.91) | 2.03 (-0.36-4.42) | 43.91 (26.55-61.28) | 2.36 (-2.27-7.00) | 37.46 (18.06-56.87) | 91.86 (NA-NA) |
| Asthma | 0.37 (-0.35-1.09) | 3.20 (1.00-5.40) | 18.17 (4.07-32.28) | 1.15 (-0.33-2.63) | 2.03 (0.84-3.22) | 2.69 (1.72-3.65) | 0.49 (-0.34-1.32) | 25.72 (5.32-46.11) | 0.08 (-0.08-0.25) | 19.79 (6.44-33.14) | 19.63 (NA-NA) |
| INMUSUPR | 0.00 (0.00-0.00) | 13.03 (-1.95-28.01) | 0.10 (-0.10-0.31) | 1.40 (-1.34-4.13) | 0.00 (0.00-0.00) | 40.38 (26.73-54.03) | 0.00 (0.00-0.00) | 0.91 (-0.88-2.70) | 0.00 (0.00-0.00) | 0.03 (-0.00-0.06) | 0.00 (NA-NA) |
| Hypertension | 0.00 (0.00-0.00) | 9.13 (1.05-17.21) | 7.59 (0.71-14.48) | 41.15 (29.72-52.59) | 68.79 (55.80-81.77) | 46.79 (37.71-55.88) | 83.71 (79.07-88.35) | 34.38 (28.21-40.55) | 96.33 (90.32-102.34) | 77.86 (60.66-95.06) | 52.94 (NA-NA) |
| Other Disease | 0.00 (0.00-0.00) | 38.32 (14.19-62.46) | 0.30 (-0.28-0.88) | 1.22 (-0.85-3.28) | 0.00 (0.00-0.00) | 48.63 (31.41-65.84) | 1.85 (-0.45-4.15) | 1.73 (-1.66-5.13) | 0.00 (0.00-0.00) | 0.82 (-0.37-2.02) | 0.00 (NA-NA) |
| Cardiovascular | 0.00 (0.00-0.00) | 17.52 (6.00-29.04) | 0.10 (-0.10-0.31) | 2.46 (0.06-4.86) | 2.17 (-2.08-6.41) | 14.25 (8.52-19.99) | 21.64 (5.73-37.54) | 4.73 (-1.62-11.07) | 5.52 (-5.30-16.33) | 26.51 (25.30-27.73) | 27.77 (NA-NA) |
| CKD | 0.00 (0.00-0.00) | 4.27 (-2.34-10.87) | 0.00 (0.00-0.00) | 3.87 (-3.69-11.42) | 0.22 (-0.21-0.65) | 31.84 (12.44-51.23) | 81.68 (67.22-96.14) | 1.04 (-0.84-2.91) | 1.92 (-1.84-5.68) | 1.28 (0.67-1.90) | 1.01 (NA-NA) |
| Age range (%|x, CI 95%) | |||||||||||
| Age<17 | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) |
| Age18-49 | 25.00 (0.00-55.01) | 33.33 (0.00-71.05) | 50.00 (15.35-84.65) | 25.00 (0.00-55.01) | 66.67 (13.32-100.00) | 28.57 (0.00-62.04) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) |
| Age50-64 | 25.00 (0.00-55.01) | 0.00 (0.00-0.00) | 37.50 (3.95-71.05) | 25.00 (0.00-55.01) | 33.33 (0.00-86.68) | 42.86 (6.20-79.52) | 50.00 (1.00-99.00) | 33.33 (0.00-86.68) | 40.00 (0.00-82.94) | 33.33 (0.00-86.68) | 0.00 (0.00-0.00) |
| Age>65 | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) | 0.00 (0.00-0.00) |
| Demographics (%|x, CI 95%) | |||||||||||
| PREGNANT | 0.49 (-0.19-1.17) | 1.28 (-0.03-2.59) | 0.30 (-0.09-0.70) | 0.80 (-0.24-1.83) | 0.33 (-0.31-0.97) | 0.26 (-0.25-0.77) | 0.01 (-0.01-0.03) | 0.00 (0.00-0.00) | 0.01 (-0.00-0.01) | 0.00 (0.00-0.00) | 0.00 (NA-NA) |
| Females | 50.00 (15.35-84.65) | 50.00 (9.99-90.01) | 50.00 (15.35-84.65) | 50.00 (15.35-84.65) | 33.33 (0.00-86.68) | 42.86 (6.20-79.52) | 50.00 (1.00-99.00) | 33.33 (0.00-86.68) | 60.00 (17.06-100.00) | 66.67 (13.32-100.00) | 0.00 (0.00-0.00) |
| Age | 43.41 (26.04-60.77) | 17.98 (5.69-30.27) | 39.80 (29.24-50.36) | 44.76 (27.37-62.15) | 46.37 (36.38-56.35) | 56.34 (45.22-67.47) | 65.32 (56.46-74.17) | 68.74 (57.36-80.12) | 66.80 (59.25-74.34) | 68.25 (57.47-79.03) | 76.43 (NA-NA) |
| Outcomes (%|x, CI 95%) | |||||||||||
| Recovered | 90.27 (80.69-99.85) | 91.37 (88.49-94.25) | 95.21 (90.75-99.67) | 82.81 (73.99-91.63) | 81.30 (72.82-89.78) | 66.94 (56.11-77.77) | 53.96 (47.85-60.08) | 66.43 (57.01-75.85) | 64.95 (57.34-72.56) | 64.42 (47.39-81.44) | 55.96 (NA-NA) |
| Survival days (deceased) | 13.61 (12.88-14.35) | 12.83 (11.49-14.17) | 14.22 (13.72-14.71) | 13.26 (12.22-14.31) | 13.33 (12.93-13.73) | 12.67 (12.01-13.34) | 11.84 (11.47-12.20) | 13.46 (12.91-14.00) | 12.94 (12.50-13.37) | 13.27 (12.78-13.76) | 12.10 (NA-NA) |
| Survival>15days | 93.46 (86.96-99.95) | 93.73 (91.36-96.10) | 97.01 (94.27-99.75) | 88.39 (82.57-94.21) | 87.27 (81.71-92.82) | 76.34 (68.52-84.16) | 65.37 (60.62-70.11) | 77.09 (70.86-83.32) | 75.26 (69.45-81.08) | 75.34 (63.34-87.35) | 67.28 (NA-NA) |
| Survival>30days | 90.74 (81.61-99.87) | 91.80 (89.06-94.54) | 95.50 (91.30-99.69) | 83.74 (75.27-92.21) | 82.14 (74.04-90.24) | 68.26 (57.71-78.80) | 55.71 (49.70-61.71) | 68.20 (59.28-77.11) | 66.33 (58.87-73.79) | 65.87 (49.36-82.39) | 56.96 (NA-NA) |
| Survival>15days_deceased | 30.76 (26.80-34.72) | 28.64 (23.96-33.33) | 36.21 (33.89-38.52) | 31.09 (25.90-36.27) | 31.59 (30.22-32.96) | 28.45 (26.02-30.89) | 24.80 (23.33-26.27) | 31.70 (28.72-34.68) | 29.73 (27.83-31.63) | 31.01 (28.71-33.31) | 25.71 (NA-NA) |
| Survival>30days_deceased | 6.61 (4.50-8.71) | 4.64 (1.88-7.41) | 5.93 (3.77-8.09) | 5.79 (3.02-8.56) | 4.52 (4.22-4.83) | 4.20 (3.42-4.99) | 3.82 (3.35-4.29) | 5.26 (5.02-5.49) | 4.04 (3.32-4.75) | 4.24 (3.53-4.94) | 2.29 (NA-NA) |
| Symptoms to hospitalization days | 3.78 (2.90-4.66) | 3.20 (2.22-4.17) | 4.87 (4.46-5.27) | 4.37 (3.65-5.09) | 5.21 (4.94-5.47) | 4.48 (4.18-4.78) | 4.30 (4.12-4.49) | 4.85 (4.69-5.02) | 4.92 (4.81-5.04) | 4.94 (4.70-5.17) | 4.82 (NA-NA) |
| Hospitalized | 19.87 (5.44-34.30) | 46.08 (25.71-66.44) | 14.15 (5.71-22.60) | 42.22 (34.68-49.75) | 44.91 (34.80-55.03) | 58.56 (47.64-69.48) | 70.72 (67.24-74.20) | 57.17 (47.81-66.53) | 60.80 (55.26-66.34) | 60.11 (43.18-77.03) | 70.47 (NA-NA) |
| ICU | 1.59 (0.41-2.77) | 9.82 (3.74-15.89) | 1.23 (0.27-2.20) | 4.48 (3.34-5.61) | 5.06 (3.32-6.80) | 4.01 (3.20-4.82) | 4.87 (4.04-5.69) | 4.81 (3.69-5.92) | 5.56 (4.86-6.26) | 5.24 (3.25-7.23) | 5.62 (NA-NA) |
| INTUBATED | 3.44 (0.30-6.58) | 9.03 (4.72-13.34) | 2.18 (0.31-4.06) | 7.90 (5.63-10.18) | 8.46 (5.20-11.72) | 12.12 (8.74-15.50) | 13.38 (12.30-14.45) | 11.50 (9.14-13.86) | 12.13 (10.29-13.96) | 12.42 (6.55-18.29) | 12.84 (NA-NA) |
| Other case contact | 45.84 (36.85-54.84) | 40.23 (33.11-47.36) | 51.18 (46.11-56.25) | 36.60 (32.48-40.71) | 36.04 (32.92-39.16) | 27.39 (22.47-32.32) | 20.90 (18.49-23.31) | 27.88 (24.16-31.60) | 27.56 (24.86-30.26) | 28.00 (21.06-34.94) | 20.89 (NA-NA) |
The heatmap also specifies the meta-cluster that each age-gender cluster belongs
df1 = subset(data_analysis_metadata, select = -c(name, Age, Gender, clusterSize, Survival_days))
df1 <- df1 %>%
dplyr::rename(
"Survival>15days" = FifteenSurvival,
"Survival>30days" = ThirtySurvival,
"Survival>15days (deceased)" = FifteenSurvival_deceased,
"Survival>30days (deceased)" = ThirtySurvival_deceased
)
# make ClusterSize groups <17, 18-49, 50-64, >65
less500 = df$clusterSize < 500
less2000 = df$clusterSize < 2000 & df$clusterSize >= 500
less5000 = df$clusterSize < 5000 & df$clusterSize >= 2000
less10000 = df$clusterSize < 10000 & df$clusterSize >= 5000
less30000 = df$clusterSize < 30000 & df$clusterSize >= 10000
more30000 = df$clusterSize > 30000
aux <- df
aux$clusterSizeCat[less500] = '<500'
aux$clusterSizeCat[less2000] = '500-1999'
aux$clusterSizeCat[less5000] = '2K-4999'
aux$clusterSizeCat[less10000] = '5K-9999'
aux$clusterSizeCat[less30000] = '10K-29999'
aux$clusterSizeCat[more30000] = '>=30000'
Clus_Size <- aux$clusterSizeCat
names(Clus_Size) <- df$name
ClusterGroup <- resultsClustering$groups
ClusterGroup <- as.factor(ClusterGroup)
names(ClusterGroup) <- df$name
### Create annotation rows
my_annotation_row = data.frame(
Cluster_size = Clus_Size,
Cluster_subgroup = ClusterGroup
)
### Create annotation rows' colors
my_colour = list(
Cluster_size = brewer.pal(9, "YlGn")[1:6],
Cluster_subgroup = brewer.pal(n = 12, name = "Set3")[1:bestk]
)
names(my_colour$Cluster_size) <- c('<500','500-1999','2K-4999','5K-9999','10K-29999','>=30000')
names(my_colour$Cluster_subgroup) <- c(1:bestk)
### Order the dataframe to group by the meta-clusters
df1$cluster <- resultsClustering$groups
df1 <- df1 %>% arrange(cluster)
df1$cluster <- NULL # Remove it because we do not want it appears twice (as numeric variables)
### Plot
### For the following pheatmap function, it is worth noting that we did not use "display_numbers = T" parameter to illustrate the exact figure (probability) of each box since the heatmap is too small which made it impossible to visualize. When user uses this tutorial we recommend to include this parameter if the user wants to generate the heatmap with numbers.
final_Heat2 <- pheatmap(df1, name='The remain variables', main='HeatMap of the 56 age-gender specific clusters among 11 meta-subgroups', angle_col = c('315'), cluster_cols=F,
annotation_row = my_annotation_row,
annotation_colors = my_colour, cluster_rows = F)
final_Heat2### Redo final_Heat2 by the order of clusters' age (optional)
target <- c("<18F1","<18F2","<18F3","<18F4",
"<18M1","<18M2","<18M3","<18M4","<18M5",
"18-49F1","18-49F2","18-49F3","18-49F4","18-49F5","18-49F6","18-49F7",
"18-49M1","18-49M2","18-49M3","18-49M4","18-49M5","18-49M6","18-49M7",
"50-64F1","50-64F2","50-64F3","50-64F4","50-64F5","50-64F6","50-64F7","50-64F8",
"50-64M1","50-64M2","50-64M3","50-64M4","50-64M5","50-64M6","50-64M7","50-64M8","50-64M9",
">64F1",">64F2",">64F3",">64F4",">64F5",">64F6",">64F7",">64F8",
">64M1",">64M2",">64M3",">64M4",">64M5",">64M6",">64M7",">64M8")
df2 <- df1[match(target, rownames(df1)),]
### Use display_numbers = T if we want to show the exact number of each box
final_Heat3 <- pheatmap(df2, name='The remain variables', main='HeatMap of the 56 age-gender specific clusters among 11 meta-subgroups', angle_col = c('315'), cluster_cols=F,cluster_rows = F)
final_Heat3generateLOESS <- function(pca, Y, Y_name) {
# pca is the result of pca whose PC1 and PC2 will be used to estimate severity variable that we want to predict
# Y is the vector that comprises the values of severity variable (e.g. data_analysis_metadata$Mortality)
# Y_name is just the name of variable Y. We will use this for legend purpose.
df <- data.frame(pca$x[,1], pca$x[,2], Y)
colnames(df) <- c('pca1','pca2','YVal')
data.loess <- loess(YVal ~ pca1 + pca2, data = df) ### Predicting Y using PC1 and PC2
graph_reso <- 0.5
xgrid <- seq(min(df$pca1), max(df$pca1), by = graph_reso)
ygrid <- seq(min(df$pca2), max(df$pca2), by = graph_reso)
data.fit <- expand.grid(pca1 = xgrid,
pca2 = ygrid)
mtrx3d <- predict(data.loess, newdata = data.fit)
# plot0 <- contour(x = xgrid, y = ygrid, z = mtrx3d, xlab = "pca1", ylab = "pca2")
# Transform data to long form
mtrx.melt <- melt(mtrx3d, id.vars = c("pca1", "pca2"), measure.vars = YVal)
names(mtrx.melt) <- c("pca1", "pca2", 'YVal')
### Eliminate negative values
mtrx.melt$YVal <- ifelse(mtrx.melt$YVal < 0, 0, mtrx.melt$YVal)
# Return data to numeric form
mtrx.melt$pca1 <- as.numeric(str_sub(mtrx.melt$pca1, str_locate(mtrx.melt$pca1, "=")[1,1] + 1))
mtrx.melt$pca2 <- as.numeric(str_sub(mtrx.melt$pca2, str_locate(mtrx.melt$pca2, "=")[1,1] + 1))
# Create ten segments to be colored in
mtrx.melt$equalSpace <- cut(mtrx.melt$YVal, 7)
# Sort the segments in ascending order
breaks <- levels(unique(mtrx.melt$equalSpace))
# Plot
plot3 <- ggplot(mtrx.melt, aes(x=pca1,
y=pca2, z=YVal)) +
geom_tile(data = mtrx.melt, aes( x=pca1,
y=pca2, fill = equalSpace)) +
geom_point(data = df, aes(x = pca1, y = pca2), shape = 1, size =8, color='red')+
# geom_text(data=df, aes(label=Y_name),hjust=0, vjust=0, size=5)+
geom_contour(color = "white", alpha = 0.5,size=0) +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
xlab("PCA1") +
ylab("PCA2") +
scale_fill_manual(values = c("#80cdc1", "#c7eae5", "#f5f5f5",
"#f6e8c3", "#dfc27d", "#bf812d", "#8c510a"),
name = Y_name, breaks = breaks, labels = breaks)
return (plot3)
}We studied Mortality, ICU, Intubation, Hospitalization, symptoms to hospital, Survival15, Survival30,Survival15_deceased, Survival30_deceased.
### Create Mortality variable based on Recovery rate
data_analysis_metadata$Mortality <- 100 - data_analysis_metadata$Recovery
# Mortality
generateLOESS(pca, data_analysis_metadata$Mortality, 'Mortality')# Symptoms_to_hospitalization_days
generateLOESS(pca, data_analysis_metadata$Symptoms_to_hospitalization_days, 'Symptoms_to_hospitalization_days')# Survival>15days (%)
generateLOESS(pca, data_analysis_metadata$FifteenSurvival, 'Survival>15days (%)')# Survival>30days (%)
generateLOESS(pca, data_analysis_metadata$ThirtySurvival, 'Survival>30days (%)')# Survival>15days_deceased (%)
generateLOESS(pca, data_analysis_metadata$Intubated, 'Survival>15days_deceased (%)')# Survival>30days_deceased (%)
generateLOESS(pca, data_analysis_metadata$ThirtySurvival_deceased, 'Survival>30days_deceased (%)')